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variability among recordings were similar to those noted in past research 
with this type of examination. Results suggest that the impact of recording 
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Findings differed for the more heterogeneous population, and the total 
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a more heterogeneous group of examinees unless some adjustments are made to 
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Assessing the Impact of Standardized Patient Variability on Examination 
Mastery-Level Decision Consistency Rates 

Standardized patient (SP) examinations are being used with increasing frequency by medical schools and 
testing organizations to assess the clinical skills of medical students in a range of simulated doctor-patient 
encounters (AAMC, 1998). SPs are laypersons trained to portray patients in clinical encounters (referred to as 
cases) and to record as well as rate examinee behaviors using case-specific checklists and rating scales (Barrows, 
1987). 

Although a large body of research has been dedicated to assessing the reliability of SP examination scores 
and related inferences, relatively few studies have addressed some more basic issues critical for all clinical skills 
assessments that utilize SPs as recorders of student behavior (Swanson & Norcini, 1989; Swanson, Norman, & 

Linn, 1995; Vu & Barrows, 1994). One of these issues is the extent to which SP checklist recording and rating 
discrepancies impact upon pass/fail decisions. This issue is of particular relevance for testing programs that train 
several SPs for each case where it must be shown that the likelihood of passing a case and/or the examination is 
unrelated to the particular cohort of patients that a student might have seen during the test. Although precautions 
can be adopted to minimize this risk (e.g., randomly assigning students to different SPs for each case and excluding 
problematic SPs), discrepancies in recording and rating accuracy could still have deleterious effects on the 
probability that a given candidate will pass the examination. 

Investigations that estimated checklist item-level recording accuracy by comparing SP responses to those of 
a scoring key have generally reported high proportion of agreement rates, ranging from the mid .70s to the upper 
.90s (De Champlain, Margolis, King, & Klass, 1997; Vu, Marcy, Colliver, Verhulst, Travis, Barrows, 1992). 
Findings reported in past generalizability studies also seem to suggest that there is little overall checklist score 
variability attributable to SPs (Swanson & Norcini, 1989; Swanson, Norman, & Linn, 1995; van der Vleuten & 
Swanson, 1990). However, it is important to point out that the small variance components estimated for the Raters 
facet (i.e. SPs) in these studies are not necessarily indicative of perfect agreement among SPs. This is especially 
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true in light of the inordinately large variance component typically associated with differences in performance from 
case to case throughout an examination (which is indicative of the content-specific nature of performances; Linn & 
Burton, 1994). Despite these small variance components, it remains possible that SP recording discrepancies are 
important with respect to the consistency with which both scores can be rank-ordered and mastery-level decisions 
can be ascribed. 

Researchers who assessed the impact of using multiple SPs per case on the reliability estimates of overall 
case and component scores (i.e. checklist, written post-encounter note, etc.) have generally reported only modest 
effects on generalizability coefficient values (Colliver, Marcy, Vu, Steward & Robbs, 1994; Colliver, Morrison, 
Markwell, Verhulst, Steward, Dawson-Saunders & Barrows, 1990). Similarly, negligible differences were noted in 
two studies that compared station pass/fail rates across multiple SPs portraying identical cases (Colliver, Robbs, & 
Vu, 1991; Reznick, Smee, Rothman, Chalmers, Swanson, Dufresne, Lacombe, Baumer, Poldre, Levasseur, Cohen, 
Mendez, Patey, Boudreau & Berard, 1992). However, a study undertaken by De Champlain, Macmillan, Klass & 
Margolis (1999) which looked at the effect of not only using multiple SPs per case but also administration at 
multiple test sites reported a significant intra-site variability effect. Although test sites on the whole (i.e. inter-site 
variability) were comparable with respect to recording consistency, differences noted at the individual SP-level (i.e. 
intra-site variability) were of serious enough concern to warrant further attention (differences in examinee scores 
ranged from 5% - 10% as a function of the SP cohort encountered). 

In particular, assessing the impact of cross-SP variability constitutes a critical first step prior to 
(polytomous) calibration, scaling and linking efforts. Much of the research dedicated to linking performance 
assessments has focused on using extensions of equating methods originally devised for use with multiple-choice 
items (Baghi, Bent, & Delain, 1995; Baker, 1992; Clauser, Ross, Nungester, & Clyman, 1997; Cope, 1995; 

Hennings & Hirsch, 1996; Huynh & Ferrara, 1994; Kim & Cohen, 1995; Sykes, Yen, & Ito, 1996; Tzou, 1996). 
Huynh & Ferrara (1994) found the partial-credit model (Masters, 1982) to be useful for equating performance 
assessments that were moderately difficult and homogeneous, with respect to content. Tzou (1996) reported that 
polytomous IRT and linear equating procedures provided comparable results with writing assessment data. The 
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simulation study conducted by Fitzpatrick and Yen (1999), using a 2-PL partial credit model, also provided useful 
guidelines to practitioners in terms of sample size, test length and reliability requirements. However, as pointed out 
by Tate (1999) in a recent simulation study, the use of most of these methods is predicated upon the assumption that 
judges’ level of severity in rating performances is comparable from form-to-form. Although this assumption might 
be plausible for the scoring of simpler essays with analytic keys, differences in stringency noted with more complex 
performance-based assessments, such as those routinely found in medical education, make this hypothesis tentative 
at best. The chief concern, as it pertains to equating within a licensure framework, is to ensure that classification of 
examinees as masters or nonmasters is accurate, especially for those test takers that are in the vicinity of the 
cutscore. That is, the equating process must yield scores that reflect underlying abilities of examinees with the 
smallest amount of estimation error. 

The purpose of the present research was therefore to estimate the extent to which recording variability 
among SPs impacts upon classification consistency with data sets simulated to reflect performances on a large-scale 
clinical skills examination. Specifically, the conditions modeled were intended to approximate those that might 
occur with SP testing as part of the United States Medical Licensing Examination (USMLE) with the following two 
populations: 

- United States Medical Graduates (USMGs) only (homogeneous population with respect to clinical skill 
level); 

- A combination of International Medical Graduates (IMGs) and USMGs (heterogeneous population with 
regard to clinical skill level). 

In addition to the latter baseline condition, classification consistency was also ascertained with data sets that were 
simulated to reflect SP recording variability by respectively adding 5% and 10% to the total examination scores 
simulated with the above described homogeneous and heterogeneous samples of simulees. The addition of 5% and 
10% to simulated expected-correct (EPC) scores was meant to reflect errors of commission (“giving the benefit of 
the doubt to examinees”) which are more common with SP tests than errors of omission (Vu, Marcy, Colliver, 
Verhulst, Travis, & Barrows, 1992). This precursory research is essential in determining whether SP (or even site)- 
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related adjustments will be necessary in subsequent calibration, scaling and linking processes. 



Methods 



Description of the NBME Standardized Patient Testing Program 

The National Board of Medical Examiners’ (NBME) SP examination is designed to assess the clinical 
skills of candidates for licensure who are about to enter their first postgraduate year. Examinees rotate through a 
series of clinical scenarios, or cases, and are evaluated on their ability to handle the case using their history-taking 
(Hx), physical examination (PE), communication (CM) and interpersonal (IP) skills. The first three skills are 
assessed using case-specific checklists. Two of the three skills are typically assessed per case. Checklists are 
composed of no more than 25 dichotomously-scored items which indicate whether or not a student has completed a 
specific task. Interpersonal skills are assessed using a six-item inventory that is identical across cases and scored on 
a five-point Likert scale. The checklist and inventory are completed by the SP after each 15 minute examinee- 
patient encounter. Also, each examinee completes an open-ended case-specific post-encounter note following each 
encounter with the SP which contains a list of their significant positive and negative findings from the encounter. 
Percent-correct scores, corresponding to the number of points obtained by the examinee out of the total number of 
available points, are currently reported for three components (checklist, interpersonal inventory and post-encounter 
note) of the SP test. Additionally, a composite case percent-correct score, corresponding to the mean of the latter 
three scores, is provided to examinees. It is important to note that the NBME SP test is currently a large-scale 
research project being considered for inclusion into the USMLE within the next five years. 

Data generation model 

Initially, proficiencies were randomly generated from a N(0,1) distribution for 100,000 simulees. These 
values were treated as the “true” proficiency estimates of simulees and denoted by 0 ( . Examinee observed scores 
were generated by adopting the basic tenet of classical test theory, i.e., any latter observed score can be decomposed 
into a true score component and an independent measurement term such that 
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where 

r } = The correlation between the case and the latent trait (the “discrimination” parameter); 

<5j = a random normal deviate corresponding to the error term associated with examinee f s response to case j. 

Then, a logistic transformation was applied to the simulated observed scores to bound them on a [0,1] interval using 
the following model 



where t } is item f s threshold value (difficulty) which relates to a classical estimate of difficulty (proportion correct 
or p-v alue) as follows: 




( 2 ) 



where 
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The latter function is equivalent to the following common IRT model 



Z^ax^.+d 

-McDonald (1985) has shown that function Z, as outlined in equation (4) can be reparameterized as 
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where O' 1 corresponds to the inverse normal distribution. Also, based on the well known assumption that the IRT 



discrimination parameter (a) relates to r ~ in the following fashion 
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Z in equation (5) can be rewritten as 
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Since a= 1 in our function (equation (8)), Z reduces to the following, 

Z=x ij +t j * s f2. (9) 

The probability of a correct response by examinee i on case j was then estimated by substituting function Z 
(equation (9)) into equation (2). The expected percent-correct score (EPC) on the total (12-case) examination was 
subsequently obtained by calculating the mean probability of a correct response across cases for examinee i (and 
multiplying by 100). 

Two additional EPC scores were estimated for each simulee. The first additional score simply entailed 
adding 5% to the expected percent-correct score initially simulated (baseline condition). The third EPC score was 
obtained by adding 10% to the measure simulated in the baseline condition. The latter two scores were intended to 
reflect those that might be obtained when “moderate” and “extreme” errors of commission are noted between two 
cohorts of SPs assigned to the same set of cases. 

Case parameters and test length 

Case difficulty and discrimination values selected for the six data sets were similar to those reported in 
Gessaroli, Swanson, & De Champlain (1998) and reflect those typically encountered with SP examinations. Mean 
case difficulty and discrimination parameter values were respectively equal to .68 (i.e. 68%) and .50. These values 
were used to initially simulate an expected percent-correct score for each of the 12 cases using equation (2). Then, 
the mean expected percent-correct score (across the 12 cases) was treated as the overall expected percent-correct 
score in all analyses. 

Pass/fail cutoff values 

True proficiency cutoff scores were set at 0 values of -1.64 and -0.84. These cutoff score values result in 
respectively failing 5% and 20% of simulees. The first failure rate (5%) is that typically encountered with USMG- 
like SP examinees whereas the second rate (20%) is characteristic of a more heterogeneous population including 




both IMGs and USMGs. 
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EPC pass/fail values were derived by estimating the score corresponding to the 5 th and 20 th percentile in the 
distribution of observed scores generated. One EPC cutoff score value was estimated for each of the following two 
conditions: 

- Baseline/homogeneous proficiency condition; 

- Baseline/heterogeneous proficiency condition. 

Analyses 

For each of the following six data sets, false positive, false negative and total misclassification rates were 
computed: 

- Baseline/homogeneous proficiency condition; 

- Baseline/heterogeneous proficiency condition; 

- Moderate SP rater discrepancies/homogeneous proficiency condition; 

- Moderate SP rater discrepancies/heterogeneous proficiency condition; 

- Extreme SP rater discrepancies/homogeneous proficiency condition; 

- Extreme SP rater discrepancies/heterogeneous proficiency condition. 

A false positive decision occurs when a true nonmaster (based on their 6 estimate) passes the examination 
based on their EPC score. Conversely, failing a true master based on their EPC score will result in a false negative 
decision. For the purposes of this study, the false positive rate corresponds to the number of false positives out of 
the total number of simulees (100,000). Similarly, the false negative rate corresponds to the number of false 
negative decisions out of the total number of simulees (100,000). 

Results 

Descriptive statistics for expected percent-correct baseline condition data set 

Mean and standard deviation EPC score values were respectively equal to 67.86% and 14.21%. 

Cronbach’s alpha for the simulated 12-case test scores was equal to 0.78 which is typical for scores derived from 
this type of performance assessment. These values were reported with several past NBME SP prototype data sets 
(Klass, De Champlain, Fletcher, King, & Macmillan, 1998). 
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Classification consistency results 

False positive, false negative and total misclassification rates for data sets simulated to reflect a 
homogeneous (USMG-like) population for the baseline, moderate, and extreme rater discrepancy conditions are 
outlined in Table 1. Overall misclassification rates ranged from 3.7% (moderate rater discrepancy condition) to 
4.1% (baseline & extreme rater discrepancy conditions). False positive rates ranged from 1.8% (baseline condition) 
to 3.9% (extreme rater discrepancy condition) whereas false negative rates varied from 0.2% (extreme rater 
discrepancy condition) to 2.3% (baseline condition). 

Table 2 provides false positive, false negative and total misclassification rates for data sets generated to 
reflect a more heterogeneous (USMG/IMG-like) population with regard to clinical skill level for the baseline, 
moderate, and extreme rater discrepancy conditions. Total misclassification rates varied from 10.4% (moderate 
rater discrepancy condition) to 13.0% (baseline condition). False positive rates ranged from 2.8% (baseline 
condition) to 10.0% (extreme rater discrepancy condition) whereas false negative rates varied from 1.5% (extreme 
rater discrepancy condition) to 10.2% (baseline condition). 

Discussion 

Performance assessments have been incorporated into local as well as large-scale testing programs with 
increasing frequency over the past decade. Alternative assessments are potentially promising in that certain 
proficiencies that are difficult (if not impossible) to measure via conventional means might be more readily targeted. 
Nonetheless, the psychometric properties of performance assessment scores still clearly need to be evaluated to 
ensure that measures reported to students are accurate representations of their proficiency level on the trait of 
interest and to preclude ill-informed inferences based on those test scores. The issue of rater comparability is 
especially critical within the realm of more complex performance based assessments traditionally found in medical 
education, such as standardized patient examinations. Recording and rating variability across SPs portraying the 
same clinical scenario is an issue that is of central concern for this type of examination given the implications for 
calibration, scaling, equating and score validity, more generally. 
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The purpose of this study was therefore to estimate the extent to which recording variability among SPs 
impacted upon classification consistency with data sets simulated to reflect performances on a large-scale clinical 
skills examination. Conditions that were simulated, in terms of the characteristics of percent-correct score 
distributions, overall failure rate, and amount of variability among recordings were similar to those noted in past 
research with this type of examination. 

Results suggest that the impact of recording discrepancies on overall misclassification rates is minimal for 
a (homogeneous) population resembling first-time USMGs. Misclassification rates varied by 0.4%, at most. These 
results are largely attributable to the fact that the cut-score was set at a value that was considerably lower that the 
mean ability level of this group. Consequently, a very large inter-SP rater variability effect would be needed to 
yield a substantial shift in decision consistent rates (exceeding the 10% shift instituted in this study). As expected, 
false positive rates increased as errors of commission became more severe whereas the addition of 10% to the 
-.simulated EPC scores resulted in a virtually nil false negative rate. 

Findings differed, however, for the more heterogeneous population of simulees where the total 
misclassification rate was 2.6% higher in the baseline condition. This result was anticipated as adding 10% shifts 
the EPC score distribution by approximately one standard deviation. It is important to also point out that false 
positive rates increased significantly as errors of commission became more severe (by more than 7% across rater 
discrepancy conditions). Given the purpose of this type of examination (medical licensure), the classification error 
that is most important to minimize is the false positive rate. This is consistent with favoring a policy that would be 
aimed at protecting the public from (potentially) incompetent physicians. The results outlined in this study suggest 
that errors of commission similar to those simulated in the current investigation could have dire consequences on the 
false positive rate for a more heterogeneous group of examinees unless certain adjustments are instituted to account 
for SP rater variability. 

In terms of calibration, scaling and equating tasks, these findings suggest that the impact of errors of 
commission would probably be modest for a USMG-like population. However, treating each case as identical for 
calibration, scaling and equating purposes, irrespective of the specific SP portraying the clinical scenario, is perhaps 
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ill-advised for a more heterogeneous population given the divergences noted in decision consistency across rater 
discrepancy conditions. 

Minimizing intra-site variability should be an important concern for all examinations that use multiple SPs 
per case so as to ensure that scores reported to students accurately reflect their true clinical skill level. Several 
methods can be adopted to reduce intra-site SP variability. From a training perspective, a periodic review of 
videotaped encounters would seem advisable to ensure that SPs are maintaining a high level of recording accuracy 
across extended periods of time. Additionally, assigning an alternate SP to monitor all encounters might also 
significantly reduce intra-site variability by including a supervisory element in the process. Finally, deriving a 
checklist score that is based on the consensus reached by a pair of SPs as to what constituted the actions undertaken 
by a student in a given encounter might also prove to be a worthwhile strategy to increase the reliability of scores. 

Treating each SP-case interaction as a unique and distinct case might also constitute another solution to the 
problem of SP rater variability in the event that remedial training cannot be offered during the course of the 
administration. For example, a headache case portrayed by SP#1 might be treated as distinct from the same clinical 
scenario as depicted by SP #2. This approach is appealing in that the psychometric characteristics (e.g. difficulty 
and discrimination) of both the case and SP portraying it can be modeled into the calibration and subsequent scaling 
processes. The disadvantage of implementing such a design is that it can lead to a very sparse data matrix. This is 
especially true within the context of a national administration where several SPs portray each case at any of up to 
several dozen test sites. In this instance, the use of a traveling cohort of SPs across administration sites might be 
worthy of future consideration since they would provide the links needed to undertake concurrent calibrations and 
subsequent scaling analyses. 

Although informative to medical educators involved in SP testing, the findings reported in this study 
should be interpreted with some degree of caution given the limited nature of the simulations. Future research 
should be aimed at replicating this research with a wider array of conditions and testing scenarios. Also, the impact 
of SP variability on parameter estimation needs to be clearly ascertained. 

Results of the present research are of great use to the present testing program. It is hoped that these results 
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will lead to further investigations in an area that is central to not only SP tests but also to all examinations that entail 
the recording and rating of behavior by human raters. 
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Table 1 

Misclassification errors for homogeneous proficiency condition by rater discrepancy level 



Rater discrepancy 


None (baseline) 


+ 5% (moderate) 


+ 10% (extreme) 


Classification error 








False positive rate 


1.8% 


3.0% 


3.9% 


False negative rate 


2.3% 


0.7% 


0.2% 


Total 


4.1% 


3.7% 


4.1% 
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Misclassification errors for heterogeneous proficiency condition by rater discrepancy level 





Rater discrepancy 


None (baseline) 


+ 5% (moderate) 


+ 10% (extreme) 


Classification error 










False positive rate 




2.8% 


6.0% 


10.0% 


False negative rate 




10.2% 


4.4% 


1.5% 


Total 




13.0% 


10.4% 


11.5% 
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